CS 224 N : Natural Language Processing

نویسنده

  • David Kale
چکیده

The objective of this project is to analyze the performance of a class-based language model and compare it to the performance of traditional n-gram language models. Class-based language models are well-studied, as is the use of clustering to learn classes of words. However, it seems fairly standard across the literature to use hard-clustering i.e. assign each word to a single class and then to use these classes in a class n-gram language model. Also, word clustering seems to be often done in conjunction with document clustering, allowing document class to be used in determining word class and vice versa. We hoped to do something a bit different from what appeared to be the standard approach. We, by no means, think our approach is novel, but we found relatively little literature whose work we seemed to be repeating. First and foremost, we clustered words entirely on part of speech information. Secondly, we believe that a single word’s class can vary according to its context, suggesting a soft-membership clustering approach that would give a probability distribution over classes given a word and its context. Finally, since word class is a hidden variable, with the actual text representing the observables , we have attempted to construct HMM-based language models. Ultimately, we found our learned hidden state class-based language model performed significantly worse than traditional n-gram word language models. We have ideas about how our approach might be improved with future research, but we are not convinced that such work is really warranted.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CS 224 N - Natural Language Processing Final Project Modification Identification in Recipe

Recipes are commonly modified for health, taste, ingredient availability reasons. A summary of modifications by other users of a recipe is useful in determining whether a proposed modification will be successful. We present a system for identifying modifications in comments posted on recipe websites. As a secondary goal, our system design process demonstrates how we can combine human understand...

متن کامل

Integrating Cognitive Simulation into the Maryland Virtual Patient

This paper briefly describes four cognitively-related aspects of modeling a virtual patient: interoception, decision-making, natural language processing and learning. These phenomena are treated within the Maryland Virtual Patient simulation and training environment.

متن کامل

Memo CS – 03 – 09

This paper concerns infrastructural work in the fields of Language Engineering, Natural Language Processing and Computational Linguistics. We begin by defining the area of software support for research and development of components in these areas as Software Architecture for Language Engineering (SALE). The rest of the paper reviews contributions to this field, covering a wide range of work ove...

متن کامل

Metonymy and metaphor: what's the difference

The tt,ee main features of tile computational approach are that: (a) literahtess, metaphor, and anomaly share common features and form a group distinct from metonymy which has characteristics that requires a quite different treatment; (b) chains of metonymies occur, supporting an observation by Reddy (1979); and (c) metonymies can co-occur with instances of either literalness, metaphor, or anom...

متن کامل

PyPLN: a Distributed Platform for Natural Language Processing

This paper presents PyPLN a distributed platform for Natural Language Processing. PyPLN leverages a vast array of NLP and text processing open source tools, managing the distribution of the workload on a variety of configurations: from a single server to a cluster of linux servers. PyPLN is developed purely in Python but makes it very easy to incorporate other softwares for specific tasks, as l...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005